Introduction

This report describes an algorithm employed to predict the quality of red wine from the vinho verde region in Portugal. The data for this project was obtained from the UCI machine learning repository [1].

Project Design

The data set consists of 1599 samples of wines and 12 variables. I partitioned the data into two sets: a train set, and a test set. The dimensions of the sets are as follows: 1200, 399 respectively.

## [1] 1200   12
## [1] 399  12

Exploratory Analysis

The purpose of the exploratory analysis is twofold. On the one hand, it should narrow the number of variables to be included on the model improving computability. On the other hand, it should identify the main features of wine quality. The first task involves statistical techniques, while the second one embraces knowledge of the wine-making process.

One technique used to reduce the number of variables is Principal component Analysis (PCA). PCA can be used to “join” two variables that are linearly correlated, thus reducing noise without sacrificing predicted power.

In that line, I used the correlation function (cor) of R on all variables on the data set, with the exception of the quality variable I searched for variables with 0.8 correlation ratio to apply PCA on them. Unfortunately, no such correlation was found. Thus, I discarded preprocessing with PCA.

## integer(0)

Given that PCA was discarded by the results obtained with the correlation function. I use the least absolute shrinkage and selection operator (LASSO). LASSO can be used to detect the variables with the least variability with respect to the quality variable. Thus, it narrowed the number of predictors (variables).

Alt text

The image above shows the variables selected with the LASSO method: volatile acidity, pH, sulphates and alcohol. In that line, the statistical techniques employed have narrowed the possible number of predictors from 11 to 4. Nonetheless, an analysis of the wine-making process is necessary to determine whether the statistical correlations found with the LASSO method are due to chance or there is a “cause-effect” relation between the variables that explains the correlation. In that line, I explore with some more detail how wine quality is evaluated.

Measuring wine quality

Wine quality is established through wine-testing procedures dedicated to identify defects on wine. The process involves feedback from experts. In that line, a group of experts or sommeliers received a score card, such as the Davids-20 point scorecard. Then, wine is given to the sommeliers to be further evaluated by assigning a score to each wine according to the parameters on the scorecard: The higher the score the better the wine.

Alt text [2]

The idea behind the Davids-20 point scorecard is to have a unique set of features to be evaluated by wine experts. In that line, the scorecard helps the judges to focus on particular aspects of the wine, so the quality can be improved by the winemaker.

Nonetheless, the experts do not usually follow the scores on the card. Moreover, they establish scores on their own [3]. The reason for this is that wine experts value certain features more than others. The result is that a wine may be considered a high quality wine by one expert, but a low quality wine by another one.

Despite the difficulty mentioned above. The score-card is valuable because although there may be a disagreement on the amount of sugar necessary for a wine to be ranked as a high quality one, there is no disagreement on whether sugar is a factor to be considered in evaluating the quality of wine. In that line, the information provided by the score-card in conjunction with the results of the LASSO method may give us the best predictors for wine quality.

Volatile acidity

Given that only Volatile acidity is present both on the score-card and on the LASSO analysis. An exploration of the relationship between Volatile acidity and wine quality seems a good place to start. “Volatile acidity (VA) is a measure of the wine’s volatile (or gaseous) acids. The primary volatile acid in wine is acetic acid, which is also the primary acid associated with the smell and taste of vinegar”[4]

In that line, higher concentrations of VA, in particular of acetic acid, may lead to a vinegar taste and therefore is considered as an indicator of spoilage. Moreover, given the importance of VA in wine, governments has put regulation in place to establish the amount of VA allow in order for a wine to be commercialized [5]

Given the information from the analysis. I conjecture that a pattern should emerge when plotting VA and the quality variable.

In the image above, a clear relation between quality and VA can be observed, that is, wines with higher concentrations of VA have the lower quality (3), while the higher quality wines (8) have less VA. Thus, VA seems to be a good predictor for wine quality.

So far, the statistical analysis along with information provided by Davids-20 point scorecard has delivered a good possible predictor. Nonetheless, not much can be said with respect to other variables of interest. In that line, a deeper exploration of the wine making process may be necessary to select predictors.

Sulphates

I found that a certain amount of VA is always to be expected in the wine. Therefore, methods, such as the use of sulphates, have been developed to reduce the concentration of VA in the wine [6]. In that line, when plotting, the quality and the sulphates variables from the data set a pattern should be expected.

The plot above shows that higher concentrations of sulphates are to be found on the best quality wines(8) in comparison with lowest quality ones(3). Suggesting that the use of sulfates to reduce the amount of VA took place.

In that line, the inclusion of the sulphates as a predictor can be justified on the two grounds. On the one hand, there is a statistical correlation between it and the quality variable shown by the LASSO model. On the other hand, there is a connection between sulphates and VA, which explains their role as a predictor of quality.

pH

Although the presence of acetic acid must be reduced on the wine, higher concentration of other acids such as tantric and malec ones, found naturally in grapes, are beneficial to the wine making process.

This lies on the fact that one of the causes of spoilage is oxidation, which can be prevented with acids. The acidity levels are measured using the pH scale that goes from cero to fourteen. The values closer to zero represent more acidity, and values closer to fourteen lower levels of it.

Alt text[7]

Typically, wines have an acidity range between 3 and 4 according to the pH scale. Thus, higher pH values indicate lower acidity levels, and therefore an increased risk of oxidation. Winemakers add sulfur acid to wine in order to decrease the pH value, and therefore reduce the risk of oxidation.

In that line, a plot of pH variable against the quality one, should show a decreasing relation, that is, lower pH values should be present on the best quality wines, while higher ones on low quality ones.

As predicted, the plot shows the decreasing relation between pH values and quality. As with the variables VA and sulphates, the statistical argument and the information of the wine-making process support the inclusion of the pH variable on the model.

Alcohol

Though sulfuric acid can be effective in lowering pH values, care has to be taken on the amount of sulfur acid added to the wine. An excess of it may lead to rotten eggs or overcooked cabbage flavor.Therefore, in order to reach appropriate levels of acidity without compromising the quality of the wine, other factors such as alcohol have to be taken into count.

Given that alcohol reduces the risk of oxidation by decreasing the pH values, one may expect an increasing in the wine quality with the increase of alcohol levels when plotting the quality variable against the alcohol one.

As anticipated, an increasing relation between alcohol content and quality can be observed in the plot above. That is, higher concentrations of alcohol are present in the best quality wines, while lesser levels of it were to be found on the lower quality ones.

Given the arguments exposed, I build a predicted algorithm based on four variables: volatile acidity, sulphates, pH, and alcohol.

Training the Model

Given that no linear relation was found between the variables on the training set, I discarded the use of any model based on linear regression. The fact that the variable to be predicted has six different outcomes leads me to conjecture that models based on Random Forest would be more suitable for the prediction.

In that line, I train two models, one based on the Random Forest Method and the other on the Boosting with Random Forest one.

In Sample error

I made two predictions in the training set to obtain the In Sample error of the model. One using the model train with the Random Forest Method, and one with the Boosting with Random Forest method respectively. The Random Forest method achieved an accuracy rate of 1, while the Boosting with Random Forest Method had one of 0.7596. I employed the confusion matrix function to calculate both values.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   8   0   0   0   0   0
##          4   0  41   0   0   0   0
##          5   0   0 509   0   0   0
##          6   0   0   0 479   0   0
##          7   0   0   0   0 153   0
##          8   0   0   0   0   0  10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9969, 1)
##     No Information Rate : 0.4242     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          1.000000  1.00000   1.0000   1.0000   1.0000 1.000000
## Specificity          1.000000  1.00000   1.0000   1.0000   1.0000 1.000000
## Pos Pred Value       1.000000  1.00000   1.0000   1.0000   1.0000 1.000000
## Neg Pred Value       1.000000  1.00000   1.0000   1.0000   1.0000 1.000000
## Prevalence           0.006667  0.03417   0.4242   0.3992   0.1275 0.008333
## Detection Rate       0.006667  0.03417   0.4242   0.3992   0.1275 0.008333
## Detection Prevalence 0.006667  0.03417   0.4242   0.3992   0.1275 0.008333
## Balanced Accuracy    1.000000  1.00000   1.0000   1.0000   1.0000 1.000000

Since the accuracy level of at least one of the models was at least 0.8., I proceeded to make one prediction on the test_set to estimate the Out Sample error using the Random forest model.

Test set and Out Sample Error

The model achieved an accuracy rate of 0.6 again, the confusion Matrix function was used to calculate the values.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   1   1   0   0   0
##          5   2   6 134  38   4   0
##          6   0   5  33 111  16   1
##          7   0   0   4  10  26   6
##          8   0   0   0   0   0   1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6842          
##                  95% CI : (0.6361, 0.7296)
##     No Information Rate : 0.4311          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4922          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.083333   0.7791   0.6981  0.56522 0.125000
## Specificity          1.000000 0.997416   0.7797   0.7708  0.94334 1.000000
## Pos Pred Value            NaN 0.500000   0.7283   0.6687  0.56522 1.000000
## Neg Pred Value       0.994987 0.972292   0.8233   0.7940  0.94334 0.982412
## Prevalence           0.005013 0.030075   0.4311   0.3985  0.11529 0.020050
## Detection Rate       0.000000 0.002506   0.3358   0.2782  0.06516 0.002506
## Detection Prevalence 0.000000 0.005013   0.4612   0.4160  0.11529 0.002506
## Balanced Accuracy    0.500000 0.540375   0.7794   0.7345  0.75428 0.562500

As expected, the accuracy rate of the Random forest model decreased when exposed to new data. Nonetheless, the accuracy rate of the Random Forest Method remains above the 0.5 threshold, as an indication of the viability of the model.

Conclusion

Paolo Cortez along with other researchers [8] created an algorithm based on SVA (singular value decomposition) for predicting the quality of wine using the same data set I use[9] . Their algorithm achieved an accuracy rate of 89.9% [10]. In contrast, my algorithm based on the Random Forest method reached an accuracy rate of 67.7%. Based only on the accuracy rate, one may argue that their algorithm performs better than mine.

Nonetheless, on the one hand, (Cortez et al.) use all the random variables within the data set as predictors (11), while I employ only four. Thus, their algorithm may require more use of memory and computation capabilities (Scalability) increasing the time to calculate a prediction. As a result, the algorithm may be difficult to implement [11].

On the other hand, my algorithm offers a set of variables and percentages that wine-makers can use to improve the quality of their wines. Although my algorithm is less accurate than the one from (Cortez et al.), it can be more useful for practical applications.

There is always a trade-off between accuracy and interpretability when it comes to the construction of an algorithm. One can have a very accurate algorithm that but may be very hard to interpret. This is important because interpretability answers the question of why certain variables are good predictors, while other ones no.

This trade-off between accuracy and interpretability lies at the heart of the distinction between machine learning and statistical learning. The former puts more focus on the side of predictability, while the latter on interpretability and statistical inference. My analysis falls into the latter.

Thus, the value of an algorithm lies in its purpose . On the one hand, if the objective is interpretability, then it makes sense to sacrifice some accuracy (as I did). On the other hand, if the goal is predictability, then one may prefer to lose interpretability (as Cortez et al. did) [12].

Bibliography

[1] See,UCI machine learning repository

[2] See, Ebeler, Susan E. Linking Flavor Chemistry to Sensory Analysis of Wine. In Flavor Chemistry: Thirty Years of Progress. Springer Science+Business Media, 1999. New York, Pp 410.

[3] See, Noble, A.C. Analysis of Wine Sensory Properties. In Wine Analysis. Springer-Verlag,1988. Berlin, Pp 22.

[4] See,Volatile Acidity in Wine

[5] See,Legal Information institute

[6] See,Mazzeo, Jacopo.What Does ‘Volatile Acidity’ Mean in Wine?

[7] See, Hale, Noelle. What is Acidity on Wine?

[8] See, P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

[9] I focused on the samples of red wine, while (Cortey et al.) analyzed both the red and white samples

[10]See, P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553, 2009.Pp 550

[11] For instance,Netlix, decided not to use the algorithm that won the million dollars price competition for improving their suggestion accuracy rate due to scalability issues

[12]It is also worth noting that the prediction accuracy for the qualities 3, 4, and 8 on red wines was very low both in mine and in the SVA-based algorithm. This suggests that not the type of algorithm, but the small size of the sample may explain the low accuracy rates of those classes in both algorithms. Because both algorithms perform better in those classes where more data points are available .